Title.PNG

In [1]:
from IPython.display import HTML
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import plotly.express as px

HTML('''<script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js "></script><script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
} else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);</script><form action="javascript:code_toggle()"><input type="submit" value="Toggle on/off for raw code"></form>
''')
Out[1]:
In [2]:
%%HTML
<script src="require.js"></script>

INTRODUCTION¶

This study aims to investigate the efficacy of regression algorithms, including Linear Regression, RandomForest, and ARIMA (AutoRegressive Integrated Moving Average), in predicting trends and volatility in cryptocurrencies, specifically Bitcoin and Ethereum. The motivation behind this exploration stems from the increasing interest in cryptocurrency markets and the need for accurate predictive models to assist investors and traders in decision-making.

PROBLEM STATEMENT¶

Cryptocurrency has become incredibly popular and easy to get into, attracting a wide range of people interested in investing and trading. Its decentralized nature and low entry barriers have made it accessible to almost anyone, promising high returns with a high risk high rewards nature. However, accurately predicting cryptocurrency trends is quite challenging, despite its widespread popularity.

For investors and traders to make informed decisions and manage risks in the cryptocurrency market, accurate forecasting is crucial. This is where predictive models come into play. By using simple regression algorithms like Linear Regression, RandomForest, and time-series analysis methods like ARIMA, we can better understand the trends of cryptocurrencies.


HIGHLIGHTS¶

  • In machine learning, the complexity of a model doesn't guarantee better performance. Simple models like linear regression should be carefully considered as they can outperform more intricate ones.

  • Bitcoin and Ethereum are characterized by significant price fluctuations (volatility), where recent price movements notably impact the next price. Over time, the impact of past patterns decreases, suggesting that seasonal or cyclical trends don't persist.

  • While a model may display a low Mean Absolute Error (MAE), focusing solely on this metric may not capture its practical effectiveness. Even with a low MAE the model fails to predict the volatile price fluctuations which is crucial when investing in crypto.

In [3]:
# IMPORT LIBRARIES
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns 

from statsmodels.tsa.arima.model import ARIMA
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor, GradientBoostingRegressor
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, mean_absolute_error
from math import sqrt
from itertools import product

import time

import warnings
warnings.filterwarnings("ignore")

DATA¶

The dataset was sourced from kaggle: Bitcoin & Ethereum prices (2014-2024) and compiled from various open trading market sources.

Description

It includes daily prices of Bitcoin (BTC) and Ethereum (ETH) spanning from 2014-09-18 to 2024-01-21 and 2017-11-10 to 2024-01-21 respectively.

Columns

  • Date: The recorded date.
  • Open: Opening price of the cryptocurrency.
  • High: Highest price during the trading period.
  • Low: Lowest price during the trading period.
  • Close: Closing price of the cryptocurrency.
  • Adj Close: Adjusted closing price.
  • Volume: Trading volume of the cryptocurrency.

BITCOIN DATA¶

In [4]:
btc_df = pd.read_csv('BTC-USD (2014-2024).csv')
btc_df.set_index('Date', inplace=True)
btc_df.head(10)
Out[4]:
Open High Low Close Adj Close Volume
Date
2014-09-18 456.859985 456.859985 413.104004 424.440002 424.440002 34483200.0
2014-09-19 424.102997 427.834991 384.532013 394.795990 394.795990 37919700.0
2014-09-20 394.673004 423.295990 389.882996 408.903992 408.903992 36863600.0
2014-09-21 408.084991 412.425995 393.181000 398.821014 398.821014 26580100.0
2014-09-22 399.100006 406.915985 397.130005 402.152008 402.152008 24127600.0
2014-09-23 402.092010 441.557007 396.196991 435.790985 435.790985 45099500.0
2014-09-24 435.751007 436.112000 421.131989 423.204987 423.204987 30627700.0
2014-09-25 423.156006 423.519989 409.467987 411.574005 411.574005 26814400.0
2014-09-26 411.428986 414.937988 400.009003 404.424988 404.424988 21460800.0
2014-09-27 403.556000 406.622986 397.372009 399.519989 399.519989 15029300.0
In [5]:
btc_df.shape
Out[5]:
(3413, 6)
In [6]:
btc_df.describe()
Out[6]:
Open High Low Close Adj Close Volume
count 3412.000000 3412.000000 3412.000000 3412.000000 3412.000000 3.412000e+03
mean 14747.360368 15091.809098 14376.126435 14758.111980 14758.111980 1.663026e+10
std 16293.633702 16683.948248 15855.901350 16295.374063 16295.374063 1.907607e+10
min 176.897003 211.731003 171.509995 178.102997 178.102997 5.914570e+06
25% 921.790009 935.210266 908.876495 921.739258 921.739258 1.685530e+08
50% 8288.819824 8464.720703 8108.011475 8285.438965 8285.438965 1.176004e+10
75% 24345.831543 24986.300293 23907.724610 24382.675293 24382.675293 2.697648e+10
max 67549.734375 68789.625000 66382.062500 67566.828125 67566.828125 3.509679e+11
  • Central Tendency: The mean close price of Bitcoin over the specified period is approximately $14,758.11, indicating the average value around which the close prices tend to cluster.

  • Spread: The standard deviation of the close prices is relatively high at approximately $16,295.37, suggesting significant variability or dispersion of the close prices from the mean.

  • Minimum and Maximum: The minimum close price recorded for Bitcoin is approximately $178.10, while the maximum close price is around $67,566.83, illustrating the wide range of price levels observed during the specified period.

  • Quartiles: The 25th percentile (Q1) close price is approximately $921.74, indicating that 25% of the close prices fall below this value. Similarly, the 75th percentile (Q3) close price is approximately $24,382.68, indicating that 75% of the close prices fall below this value.

  • Median: The median close price (50th percentile or Q2) is approximately $8,285.44, representing the middle value of the close prices when arranged in ascending order.

In [7]:
fig, axes = plt.subplots(1, 3, figsize=(20, 6))

# Plot 1: Time Series
sns.lineplot(x=btc_df.index, y=btc_df['Close'], data=btc_df, ax=axes[0])
axes[0].set_xticks(btc_df.index[::6*30])
axes[0].set_xticklabels(btc_df.index[::6*30])
axes[0].set_title("Close Price Time Series")
axes[0].tick_params(axis='x', rotation=45)

# Plot 2: Boxplot
btc_df["Close"].plot(kind="box", vert=False, title="Distribution of BTC Close Prices", ax=axes[1])

# Plot 3: Distribution Plot
sns.distplot(btc_df["Close"], kde=True, color='blue', bins=30, ax=axes[2])
axes[2].set_title("Distribution of BTC Close Prices")

plt.tight_layout()
plt.show()
No description has been provided for this image
In [8]:
fig, axes = plt.subplots(2, 2, figsize=(10, 10)) 

# Lag values
lags = [1, 7, 30, 90]  # Reduce to fit in 2x2 grid

for i, ax in enumerate(axes.flat):
    lag = lags[i]
    ax.scatter(x=btc_df["Close"].shift(lag), y=btc_df["Close"])
    ax.plot([btc_df["Close"].min(), btc_df["Close"].max()], [btc_df["Close"].min(), btc_df["Close"].max()], linestyle="--", color="red")
    ax.set_xlabel(f"Close (lagged by {lag} days)")
    ax.set_ylabel("Close")
    ax.set_title(f"Autocorrelation of Close Prices ({lag}-day Lag)")

plt.tight_layout()
plt.show()
No description has been provided for this image

The observation regarding the strong autocorrelation at lag 1 suggests that there is a strong relationship between the current value of the time series (in this case, the close prices of Bitcoin) and its immediate past value. In other words, if the price of Bitcoin has been increasing or decreasing recently, it is likely to continue in the same direction in the short term. This phenomenon is commonly observed in financial time series data, where short-term trends or momentum tend to persist.

On the other hand, the weak autocorrelation at higher lags suggests that there is little persistence in any seasonality or cyclical patterns in the Bitcoin price data beyond the immediate past. This implies that any recurring patterns or cycles in Bitcoin prices tend to be short-lived or irregular, making them less predictable over longer time horizons.

Overall, the strong autocorrelation at lag 1 indicates the presence of short-term predictive power in the Bitcoin price data, while the weak autocorrelation at higher lags suggests that longer-term forecasting may be more challenging due to the lack of persistent seasonality or cyclical patterns.


METHODOLOGY¶

  1. Data Cleaning / Preprocessing

Ensuring data integrity by check and handle missing data.

Encode Features: Incorporating lagged values (since we are dealing time series data) of cryptocurrency prices to capture temporal dependencies.

  1. AutoML

Establishing baseline performance using Linear Regression, RandomForest, GradientBoost and DecisionTree models without additional enhancements.

Perform Grid search on ARIMA parameters

  1. Improving the Model

Feature Engineering: Enhancing models by incorporating measures of volatility to better capture the dynamic nature of cryptocurrency markets.

Step 1: Cleaning and Preprocessing¶

Check for Null Data¶

In [9]:
null_counts = btc_df.isnull().sum()
print(null_counts)
Open         1
High         1
Low          1
Close        1
Adj Close    1
Volume       1
dtype: int64
In [10]:
null_indices = btc_df[btc_df.isnull().any(axis=1)].index
print(null_indices)
Index(['2024-01-20'], dtype='object', name='Date')

The DataFrame btc_df contains null values in all columns for the date '2024-01-20'.

In [11]:
btc_df_sorted = btc_df.sort_values(by='Date', ascending=False)
btc_df_sorted.head()
Out[11]:
Open High Low Close Adj Close Volume
Date
2024-01-21 41671.488281 41693.160156 41615.140625 41623.695313 41623.695313 1.127404e+10
2024-01-20 NaN NaN NaN NaN NaN NaN
2024-01-19 41278.460938 42134.160156 40297.457031 41618.406250 41618.406250 2.575241e+10
2024-01-18 42742.312500 42876.347656 40631.171875 41262.058594 41262.058594 2.521836e+10
2024-01-17 43132.101563 43189.890625 42189.308594 42742.652344 42742.652344 2.085123e+10
In [12]:
btc_df.drop(index=['2024-01-21', '2024-01-20'], inplace=True)

The last two rows from the DataFrame btc_df, corresponding to the dates '2024-01-21' and '2024-01-20'. This action is deemed acceptable since the dates to be removed are the final two entries in the dataset, thus ensuring the maintenance of chronological order without compromising the integrity of the data.

Encode Features¶

The autocorrelation plots indicate a notable positive autocorrelation at lag 1 for both Bitcoin closing prices. This signifies that the current closing price is strongly influenced by its immediate past value, suggesting a short-term predictive power of the previous day's price on the current day's price.

To implement this insight, a lag 1 close feature can be introduced into the dataset. This involves creating a new column named "Close_Lag1," where each entry represents the closing price from the previous day. By incorporating this lag 1 feature, the model gains valuable information about the recent trend in prices, potentially enhancing its ability to capture temporal dependencies and improve predictive accuracy.

In [13]:
df = btc_df.copy()
In [14]:
df["Close.L1"] = df["Close"].shift(1)
df.dropna(inplace = True)
df.head()
Out[14]:
Open High Low Close Adj Close Volume Close.L1
Date
2014-09-19 424.102997 427.834991 384.532013 394.795990 394.795990 37919700.0 424.440002
2014-09-20 394.673004 423.295990 389.882996 408.903992 408.903992 36863600.0 394.795990
2014-09-21 408.084991 412.425995 393.181000 398.821014 398.821014 26580100.0 408.903992
2014-09-22 399.100006 406.915985 397.130005 402.152008 402.152008 24127600.0 398.821014
2014-09-23 402.092010 441.557007 396.196991 435.790985 435.790985 45099500.0 402.152008

Step 2: AutoML¶

Split train and test sets¶

The data has been split into feature and target variables, with "Close" designated as the target variable. The feature set, denoted as X, comprises all columns except the target variable. Subsequently, the dataset has been divided into training and testing sets, with an 80-20 split ratio. The training set (X_train, y_train) consists of the initial 80% of the data, while the testing set (X_test, y_test) comprises the remaining 20% of the data. This division facilitates model training on the training set and evaluation on the unseen testing set, allowing for the assessment of model performance on new data.

In [15]:
# Split the data into feature and target
target = "Close"
y = df[target]
X = df["Close.L1"]

#Split the data into train and test sets
cutoff = int(len(X) * 0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]

Calculate Baseline¶

The mean score baseline model has been established by predicting the mean close price as the constant prediction for all instances in the training set. The mean close price for the training set is approximately $11508.07, with a baseline Mean Absolute Error (MAE) of approximately $11560.94. The Root Mean Squared Error (RMSE) for the baseline model, calculated using the mean close price of the training set as the prediction for all instances in the testing set, is approximately $18127.68.

These baseline serves as a reference point for evaluating the performance of more complex models, providing insight into the effectiveness of predictive models beyond simply predicting the mean.

In [16]:
# Mean Score Baseline
y_pred_baseline = [y_train.mean()] * len(y_train)
mae_baseline = mean_absolute_error(y_train, y_pred_baseline)
print("Mean Close Prices:", round(y_train.mean(), 2))
print("Baseline MAE:", round(mae_baseline, 2))
Mean Close Prices: 11508.07
Baseline MAE: 11560.94
In [17]:
# RMSE Baseline
baseline_prediction = np.mean(y_train)
baseline_predictions = np.full_like(y_test, baseline_prediction)
baseline_rmse = np.sqrt(mean_squared_error(y_test, baseline_predictions))

print("Baseline RMSE:", round(baseline_rmse, 2))
Baseline RMSE: 18104.27

Model Baselines¶

Next, we proceed to train and evaluate various regression models to assess their effectiveness in predicting cryptocurrency prices. By iteratively training and evaluating each model, we aim to understand each model's performance by checking metrics like Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), and runtime.

The idea is to establish baseline performance using Linear Regression, RandomForest, GradientBoost, and DecisionTree without additional enhancements. This initial step provides a fundamental benchmark for evaluating the effectiveness of more complex predictive models. By comparing the performance of these basic models, we can assess their predictive capabilities and identify areas for improvement.

In [18]:
def train_and_evaluate(model, X_train, y_train, X_test, y_test):
    start_time = time.time()
    model.fit(X_train, y_train)
    training_mae = mean_absolute_error(y_train, model.predict(X_train))
    test_mae = mean_absolute_error(y_test, model.predict(X_test))
    training_rmse = np.sqrt(mean_squared_error(y_train, model.predict(X_train)))
    test_rmse = np.sqrt(mean_squared_error(y_test, model.predict(X_test)))
    runtime = time.time() - start_time
    return training_mae, test_mae, training_rmse, test_rmse, runtime


def run_baseline(model, X_train, y_train, X_test, y_test, num_tests=10):
    results = []
    for _ in range(num_tests):
        training_mae, test_mae, training_rmse, test_rmse, runtime = train_and_evaluate(model, X_train, y_train, X_test, y_test)
        results.append({'Model Name': type(model).__name__, 'MAE': test_mae, 'RMSE': test_rmse, 'Runtime': runtime})
    return results


# Initialize models
models = [
    RandomForestRegressor(),
    LinearRegression(),
    GradientBoostingRegressor(),
    DecisionTreeRegressor()
]
results_all_models = []

# Train and evaluate each model
for model in models:
    results = run_baseline(model, X_train.values.reshape(-1, 1), y_train, X_test.values.reshape(-1, 1), y_test)
    results_all_models.extend(results)

# Convert results to DataFrame
df_results = pd.DataFrame(results_all_models)

# Compute averages
avg_results = df_results.groupby('Model Name').mean()

avg_results
Out[18]:
MAE RMSE Runtime
Model Name
DecisionTreeRegressor 1361.309005 1797.699143 0.010311
GradientBoostingRegressor 1030.262711 1315.233570 0.219155
LinearRegression 515.438167 792.148102 0.003849
RandomForestRegressor 1108.432802 1453.098120 0.626320
In [19]:
# # Function to perform ARIMA grid search
# def arima_grid_search(train_data, test_data, p_values, d_values, q_values):
#     results = []
#     for p, d, q in product(p_values, d_values, q_values):
#         try:
#             history = [x for x in train_data]
#             predictions = []
#             # walk-forward validation
#             for t in range(len(test_data)):
#                 model = ARIMA(history, order=(p, d, q))
#                 model_fit = model.fit()
#                 output = model_fit.forecast()
#                 yhat = output[0]
#                 predictions.append(yhat)
#                 obs = test_data[t]
#                 history.append(obs)

#             # evaluate forecasts
#             mae = mean_absolute_error(test_data, predictions)
#             rmse = sqrt(mean_squared_error(test_data, predictions))
#             results.append((p, d, q, mae, rmse))
#         except:
#             continue

#     return pd.DataFrame(results, columns=['p', 'd', 'q', 'MAE', 'RMSE'])


# # Define ARIMA parameters for grid search
# p_values = range(0, 6)  # p values
# d_values = range(0, 2)  # d values
# q_values = range(0, 2)  # q values

# # Perform grid search
# grid_results = arima_grid_search(train, test, p_values, d_values, q_values)

# # Output results to DataFrame
# grid_results

Grid Search: ARIMA¶

Next, we perform a grid search to fine-tune the parameters of the ARIMA model. This process involves systematically evaluating different combinations of model parameters to identify the optimal configuration that yields the best performance. By leveraging grid search, we aim to enhance the predictive accuracy of the ARIMA model and improve its suitability for forecasting cryptocurrency prices.

p d q MAE RMSE
0 0 0 14632.688819 16611.304597
0 0 1 7445.398557 8543.820176
0 1 0 514.586990 791.797907
0 1 1 513.378288 791.093285
1 0 0 514.354121 791.915688
1 0 1 513.151961 791.228830
1 1 0 513.364870 791.057282
1 1 1 513.407777 791.106320
2 0 0 513.148681 791.200314
2 0 1 513.344987 791.352766
2 1 0 513.550905 791.103317
2 1 1 514.060564 791.342664
3 0 0 513.338597 791.221013
3 0 1 513.309885 791.278221
3 1 0 514.198221 790.639340
3 1 1 515.883490 790.754452
4 0 0 513.948599 790.733309
4 0 1 514.188689 791.002517
4 1 0 516.273217 791.916986
4 1 1 516.826047 792.005012
5 0 0 516.316086 792.012733
5 0 1 516.121262 791.976760
5 1 0 516.749558 792.156649
5 1 1 516.930870 792.185792
In [20]:
# grid_results_sorted = grid_results.sort_values(by=['MAE', 'RMSE'])

# grid_results_sorted.head(1)
p d q MAE RMSE
2 0 0 513.148681 791.200314

Based on the grid search we get the best parameters:

  • p (AR order): The value of 2 indicates that the model includes two lagged observations of the dependent variable (time series) as predictors. This means that the current value of the time series is modeled as a linear combination of its two most recent observations.

  • d (I order): The value of 0 implies that no differencing is required to make the time series stationary. In other words, the original time series is stationary or does not exhibit any trend or seasonality that needs to be removed through differencing.

  • q (MA order): The value of 0 suggests that the model does not include any lagged forecast errors (residuals) of the dependent variable as predictors. Therefore, the model does not explicitly capture any short-term dependencies beyond the lagged observations.

Given these parameter values, the ARIMA(2, 0, 0) model can be interpreted as follows:

The current value of the time series is linearly dependent on its two most recent observations (AR component). No differencing is required to make the time series stationary (I component), implying that the original time series already exhibits stationarity. The model does not incorporate any additional short-term dependencies beyond the lagged observations (MA component).

In [21]:
history = [x for x in y_train]
predictions = []

for t in range(len(y_test)):
    model = ARIMA(history, order=(2, 0, 0))
    model_fit = model.fit()
    output = model_fit.forecast()
    yhat = output[0]
    predictions.append(yhat)
    obs = y_test[t]
    history.append(obs)
In [22]:
df_pred_test = pd.DataFrame(
             {
                 "y_test": y_test,
                 "y_pred": predictions
             }, index=btc_df.index[-len(predictions):]
)

df_last_30 = df_pred_test.iloc[-90:]

# Create the line plot with Plotly
fig = px.line(df_last_30, labels={"value": "Close Price"}, title="ARIMA(p=2, d=0, q=0) Model: Actual Prices vs. Predicted Prices (Last 90 Data Points)")
fig.show()

Determining the Best Model¶

The ARIMA model with parameters (p=2, d=0, q=0) achieved an MAE (Mean Absolute Error) of approximately 513.15 and an RMSE (Root Mean Squared Error) of approximately 791.20. On the other hand, the Linear Regression model achieved a slightly higher MAE of approximately 515.44 and a comparable RMSE of approximately 792.15.

Comparing the performance of these two models, we observe that the ARIMA model slightly outperformed the Linear Regression model in terms of MAE, indicating that it made slightly more accurate predictions. However, the difference in performance between the two models is very minimal, suggesting that both models are relatively similar in their predictive capabilities for this particular dataset.

The runtime for the Linear Regression model is approximately 0.008 seconds, whereas the ARIMA model took approximately 1 minute and 24.65 seconds to train and evaluate.

Given that no differencing is required to make the time series stationary, it implies that the original time series already exhibits stationarity. In such cases, Linear Regression may indeed be the preferred choice.

Linear Regression offers simplicity, interpretability, and computational efficiency, making it suitable for scenarios where the time series data does not exhibit complex temporal dependencies or nonlinear patterns. Additionally, the shorter runtime of Linear Regression compared to ARIMA further supports its practical applicability in scenarios where computational resources or runtime constraints are a concern.

In [23]:
df = btc_df.copy()

df["Close.L1"] = df["Close"].shift(1)
df.dropna(inplace = True)

# Split the data into featuer and target
target = "Close"
y = df[target]
X = df[["Close.L1"]]

#Split the data into train and test sets
cutoff = int(len(X) * 0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]


model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

test_mae = mean_absolute_error(y_test, predictions)
print("MAE:", test_mae)
MAE: 515.4381667582289
In [24]:
# Accessing the coefficients
coefficients = model.coef_
coefficients = coefficients.reshape(1, -1)

# Create a DataFrame to display the coefficients
coefficients_df = pd.DataFrame(coefficients, columns=X_train.columns, index=['Coefficient'])
coefficients_df
Out[24]:
Close.L1
Coefficient 0.99926
In [25]:
df_pred_test = pd.DataFrame(
             {
             "y_test": y_test,
             "y_pred": predictions
             }
)

df_last_30 = df_pred_test.iloc[-90:]

# Create the line plot with Plotly
fig = px.line(df_last_30, labels={"value": "BTC Close Price"}, title="Linear Regression Model: Actual Prices vs. Predicted Prices (Last 90 Data Points)")
fig.show()

Step 3: Improving the Model¶

Feature Engineering¶

In an effort to enhance the predictive capabilities of our model, we've introduced additional features derived from the cryptocurrency price data. Specifically, we've calculated volatility measures at daily, weekly, monthly, and yearly intervals using rolling window standard deviations of the closing prices. These volatility measures capture the degree of price fluctuation within each respective time frame.

Additionally, we've incorporated lagged versions of the closing prices and volatility measures into the dataset. By shifting these features back by one and two time steps, we provide the model with historical information, enabling it to capture temporal dependencies and patterns in the data.

Furthermore, to ensure that our dataset remains consistent after introducing these new features, we've removed any rows containing missing values resulting from the lag operations.

In [26]:
df = btc_df.copy()

df['Volatility_Daily'] = df['Close'].rolling(window=2, min_periods=1).std()
df['Volatility_Weekly'] = df['Close'].rolling(window=7).std()
df['Volatility_Monthly'] = df['Close'].rolling(window=30).std()
df['Volatility_Yearly'] = df['Close'].rolling(window=365).std()
df["Close.L1"] = df["Close"].shift(1)
df["Close.L2"] = df["Close"].shift(2)


df['Volatility_Daily.L1'] = df['Volatility_Daily'].shift(1)
df['Volatility_Daily.L2'] = df['Volatility_Daily'].shift(2)

df['Volatility_Weekly.L1'] = df['Volatility_Weekly'].shift(1)
df['Volatility_Weekly.L2'] = df['Volatility_Weekly'].shift(2)

df['Volatility_Monthly.L1'] = df['Volatility_Monthly'].shift(1)
df['Volatility_Monthly.L2'] = df['Volatility_Monthly'].shift(2)

df['Volatility_Yearly.L1'] = df['Volatility_Yearly'].shift(1)
df['Volatility_Yearly.L2'] = df['Volatility_Yearly'].shift(2)


df.dropna(inplace = True)
df.head()
Out[26]:
Open High Low Close Adj Close Volume Volatility_Daily Volatility_Weekly Volatility_Monthly Volatility_Yearly Close.L1 Close.L2 Volatility_Daily.L1 Volatility_Daily.L2 Volatility_Weekly.L1 Volatility_Weekly.L2 Volatility_Monthly.L1 Volatility_Monthly.L2 Volatility_Yearly.L1 Volatility_Yearly.L2
Date
2015-09-19 232.858002 233.205002 231.089005 231.492996 231.492996 12712600.0 1.047939 1.250336 6.346485 57.053890 232.975006 229.809998 2.237999 0.508406 2.134827 3.998461 6.393035 6.437738 57.309662 57.745595
2015-09-20 231.399002 232.365005 230.910004 231.212006 231.212006 14444700.0 0.198690 1.261682 6.340692 56.710777 231.492996 232.975006 1.047939 2.237999 1.250336 2.134827 6.346485 6.393035 57.053890 57.309662
2015-09-21 231.216995 231.216995 226.520996 227.085007 227.085007 19678800.0 2.918229 1.890600 6.381719 56.432606 231.212006 231.492996 0.198690 1.047939 1.261682 1.250336 6.340692 6.346485 56.710777 57.053890
2015-09-22 226.968994 232.386002 225.117004 230.617996 230.617996 25009300.0 2.498200 1.894945 6.360249 56.120629 227.085007 231.212006 2.918229 0.198690 1.890600 1.261682 6.381719 6.340692 56.432606 56.710777
2015-09-23 230.936005 231.835007 229.591003 230.283005 230.283005 17254100.0 0.236874 1.817409 5.044624 55.570408 230.617996 227.085007 2.498200 2.918229 1.894945 1.890600 6.360249 6.381719 56.120629 56.432606
In [27]:
# Split the data into featuer and target
target = "Close"
y = df[target]
X = df[["Close.L1", "Close.L2",
        "Volatility_Daily.L1",
        "Volatility_Weekly.L1",
        "Volatility_Monthly.L1",
        "Volatility_Yearly.L1"]]

#Split the data into train and test sets
cutoff = int(len(X) * 0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]


model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

test_mae = mean_absolute_error(y_test, predictions)
print("MAE:", test_mae)
MAE: 458.71113525341354
In [28]:
# Accessing the coefficients
coefficients = model.coef_
coefficients = coefficients.reshape(1, -1)


# Create a DataFrame to display the coefficients
coefficients_df = pd.DataFrame(coefficients, columns=X_train.columns, index=['Coefficient'])
coefficients_df
Out[28]:
Close.L1 Close.L2 Volatility_Daily.L1 Volatility_Weekly.L1 Volatility_Monthly.L1 Volatility_Yearly.L1
Coefficient 0.967635 0.027756 0.067707 0.008924 0.020558 0.000455
In [29]:
df_pred_test = pd.DataFrame(
             {
             "y_test": y_test,
             "y_pred": predictions
             }
)
df_pred_test.head(10)
Out[29]:
y_test y_pred
Date
2022-05-21 29432.226563 29280.383197
2022-05-22 30323.722656 29431.059319
2022-05-23 29098.910156 30328.908085
2022-05-24 29655.585938 29184.468628
2022-05-25 29562.361328 29655.061798
2022-05-26 29267.224609 29554.169498
2022-05-27 28627.574219 29274.018843
2022-05-28 28814.900391 28662.201573
2022-05-29 29445.957031 28800.726745
2022-05-30 31726.390625 29432.724339
In [30]:
df_last_30 = df_pred_test.iloc[-90:]

# Create the line plot with Plotly
fig = px.line(df_last_30, labels={"value": "BTC Close Price"}, title="Improved Linear Regression Model: Actual Prices vs. Predicted Prices (Last 90 Data Points)")
fig.show()

The Mean Absolute Error (MAE) has improved from approximately 515.44 to 458.71 after incorporating the additional features and lagged variables into the model. This reduction in MAE indicates that the model's predictive accuracy has improved, suggesting that the introduced enhancements have effectively captured additional information from the data.

We also build the same model for Ethereum (ETH) using the same features and parameters employed for the Bitcoin (BTC) data. This involves preprocessing the Ethereum dataset to incorporate additional features such as volatility measures and lagged variables.

In [31]:
eth_df = pd.read_csv('ETH-USD (2017-2024).csv')
eth_df.set_index('Date', inplace=True)

df = eth_df.copy()

df['Volatility_Daily'] = df['Close'].rolling(window=2, min_periods=1).std()
df['Volatility_Weekly'] = df['Close'].rolling(window=7).std()
df['Volatility_Monthly'] = df['Close'].rolling(window=30).std()
df['Volatility_Yearly'] = df['Close'].rolling(window=365).std()
df["Close.L1"] = df["Close"].shift(1)
df["Close.L2"] = df["Close"].shift(2)


df['Volatility_Daily.L1'] = df['Volatility_Daily'].shift(1)
df['Volatility_Daily.L2'] = df['Volatility_Daily'].shift(2)

df['Volatility_Weekly.L1'] = df['Volatility_Weekly'].shift(1)
df['Volatility_Weekly.L2'] = df['Volatility_Weekly'].shift(2)

df['Volatility_Monthly.L1'] = df['Volatility_Monthly'].shift(1)
df['Volatility_Monthly.L2'] = df['Volatility_Monthly'].shift(2)

df['Volatility_Yearly.L1'] = df['Volatility_Yearly'].shift(1)
df['Volatility_Yearly.L2'] = df['Volatility_Yearly'].shift(2)


df.dropna(inplace = True)
df.head()
Out[31]:
Open High Low Close Adj Close Volume Volatility_Daily Volatility_Weekly Volatility_Monthly Volatility_Yearly Close.L1 Close.L2 Volatility_Daily.L1 Volatility_Daily.L2 Volatility_Weekly.L1 Volatility_Weekly.L2 Volatility_Monthly.L1 Volatility_Monthly.L2 Volatility_Yearly.L1 Volatility_Yearly.L2
Date
2018-11-11 212.479004 212.998993 208.867996 211.339996 211.339996 1.501600e+09 0.843585 3.526736 5.753073 270.164345 212.533005 210.074005 1.738776 1.525228 4.083468 6.168682 5.839789 6.283667 269.871540 269.619064
2018-11-12 211.513000 212.623001 208.923996 210.417999 210.417999 1.452380e+09 0.651950 3.311558 5.732665 270.443820 211.339996 212.533005 0.843585 1.738776 3.526736 4.083468 5.753073 5.839789 270.164345 269.871540
2018-11-13 210.149002 210.514999 206.134995 206.826004 206.826004 1.610260e+09 2.539924 3.135081 5.420623 270.755271 210.417999 211.339996 0.651950 0.843585 3.311558 3.526736 5.732665 5.753073 270.443820 270.164345
2018-11-14 206.533997 207.044998 174.084000 181.397003 181.397003 2.595330e+09 17.981019 11.187730 6.989767 271.200449 206.826004 210.417999 2.539924 0.651950 3.135081 3.311558 5.420623 5.732665 270.755271 270.443820
2018-11-15 181.899002 184.251007 170.188995 180.806000 180.806000 2.638410e+09 0.417902 14.324449 8.201139 271.637530 181.397003 206.826004 17.981019 2.539924 11.187730 3.135081 6.989767 5.420623 271.200449 270.755271
In [32]:
# Split the data into featuer and target
target = "Close"
y = df[target]
X = df[["Close.L1", "Close.L2",
        "Volatility_Daily.L1",
        "Volatility_Weekly.L1",
        "Volatility_Monthly.L1",
        "Volatility_Yearly.L1"]]

#Split the data into train and test sets
cutoff = int(len(X) * 0.8)
X_train, y_train = X.iloc[:cutoff], y.iloc[:cutoff]
X_test, y_test = X.iloc[cutoff:], y.iloc[cutoff:]


model = LinearRegression()
model.fit(X_train, y_train)

predictions = model.predict(X_test)

test_mae = mean_absolute_error(y_test, predictions)
print("MAE:", test_mae)
MAE: 32.55243778097503

The Mean Absolute Error (MAE) for Ethereum (ETH) is impressively low at 32.55, indicating that the model's predictions are, on average, very close to the actual observed values. However, it's essential to consider the scale and variability of the Ethereum price data

In [33]:
eth_df.describe()
Out[33]:
Open High Low Close Adj Close Volume
count 2263.000000 2263.000000 2263.000000 2263.000000 2263.000000 2.263000e+03
mean 1248.213140 1283.972388 1208.851543 1248.970441 1248.970441 1.205243e+10
std 1118.835543 1150.922648 1082.560829 1118.566081 1118.566081 1.012443e+10
min 84.279694 85.342743 82.829887 84.308296 84.308296 6.217330e+08
25% 231.636727 236.766563 227.149369 231.901916 231.901916 4.845689e+09
50% 1038.186646 1090.229980 956.325012 1039.099976 1039.099976 9.401190e+09
75% 1870.983582 1905.373352 1844.880860 1871.952942 1871.952942 1.657259e+10
max 4810.071289 4891.704590 4718.039063 4812.087402 4812.087402 8.448291e+10
In [34]:
# Accessing the coefficients
coefficients = model.coef_
coefficients = coefficients.reshape(1, -1)


# Create a DataFrame to display the coefficients
coefficients_df = pd.DataFrame(coefficients, columns=X_train.columns, index=['Coefficient'])
coefficients_df
Out[34]:
Close.L1 Close.L2 Volatility_Daily.L1 Volatility_Weekly.L1 Volatility_Monthly.L1 Volatility_Yearly.L1
Coefficient 0.930871 0.063868 0.003381 0.05042 -0.015873 0.007528
In [35]:
df_pred_test = pd.DataFrame(
             {
             "y_test": y_test,
             "y_pred": predictions
             }
)
df_pred_test.tail(10)
Out[35]:
y_test y_pred
Date
2024-01-10 2582.103516 2337.871501
2024-01-11 2619.619141 2563.255347
2024-01-12 2524.460205 2614.551210
2024-01-13 2576.597900 2528.455932
2024-01-14 2472.241211 2570.218343
2024-01-15 2511.363770 2474.528096
2024-01-16 2587.691162 2502.872089
2024-01-17 2528.369385 2574.448874
2024-01-18 2467.018799 2523.999140
2024-01-19 2489.498535 2462.939432
In [36]:
df_last_30 = df_pred_test.iloc[-90:]

# Create the line plot with Plotly
fig = px.line(df_last_30, labels={"value": "ETH Close Price"}, title="Linear Regression Model: Actual Prices vs. Predicted Prices (Last 90 Data Points)")
fig.show()

RESULTS AND DISCUSSION¶

In machine learning, the complexity of a model doesn't guarantee better performance. Simple models like linear regression should be carefully considered as they can outperform more intricate ones. Our analysis supports this notion as Linear Regression emerged as the best-performing model, outperforming more complex alternatives like ARIMA. Tree-based regression methods showed promise, although their low effectiveness for time series forecasting was expected.

Bitcoin and Ethereum are characterized by significant price fluctuations (volatility), where recent price movements notably impact the next price. Over time, the impact of past patterns decreases, suggesting that seasonal or cyclical trends don't persist. This observation is supported by the autocorrelation analysis, highlighting the challenges in forecasting cryptocurrency prices accurately due to their dynamic and non-persistent nature.

While a model may display a low Mean Absolute Error (MAE), focusing solely on this metric may not capture its practical effectiveness. Even with a low MAE, the model fails to predict the volatile price fluctuations, which is crucial when investing in crypto. It is obviously seen in the graph where the model also has a lag in following the trend. For instance, the MAE for Bitcoin was 458.71, whereas for Ethereum, it was at 32.55. Despite the low MAE for both Bitcoin and Ethereum, the model may still struggle to predict extreme price movements accurately.

However, these models may not suffice due to several reasons. Firstly, cryptocurrency markets are highly volatile, with prices experiencing rapid and unpredictable fluctuations. ARIMA and linear regression models may struggle to capture and forecast such extreme movements effectively. Additionally, cryptocurrency price series often exhibit non-stationary behavior, meaning that their statistical properties change over time. Traditional time series models like ARIMA typically assume stationarity, leading to inaccurate forecasts. Moreover, cryptocurrency prices are influenced by a wide range of factors, including investor sentiment, market speculation, technological developments, regulatory changes, and macroeconomic events. Traditional models may not adequately capture these complex interactions.

To address these limitations, we could explore smaller granularity data (e.g., hourly or minute-level data) and incorporating fundamental factors as features into the model. By considering a broader range of variables and refining the granularity of the data, we can potentially improve the accuracy and robustness of cryptocurrency price forecasts.


SOURCES¶

  • https://medium.com/@sawsanyusuf/linear-regression-with-time-series-data-9186eb1ee607
  • https://www.futurelearn.com/info/courses/advanced-data-mining-with-weka/0/steps/29456
  • https://machinelearningmastery.com/arima-for-time-series-forecasting-with-python/